{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6b82288e",
   "metadata": {},
   "source": [
    "# Understanding the basic structure of RAG\n",
    "\n",
    "- Author: [Sun Hyoung Lee](https://github.com/LEE1026icarus) \n",
    "- Peer Review: \n",
    "- Proofread : [BokyungisaGod](https://github.com/BokyungisaGod)\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/12-RAG/01-RAG-Basic-PDF.ipynb)[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/12-RAG/01-RAG-Basic-PDF.ipynb)\n",
    "## Overview\n",
    "\n",
    "### 1. Pre-processing - Steps 1 to 4\n",
    "![rag-1.png](./assets/12-rag-rag-basic-pdf-rag-process-01.png)\n",
    "\n",
    "The pre-processing stage involves four steps to load, split, embed, and store documents into a Vector DB (database).\n",
    "\n",
    "- **Step 1: Document Load** : Load the document content.  \n",
    "- **Step 2: Text Split** : Split the document into chunks based on specific criteria.  \n",
    "- **Step 3: Embedding** : Generate embeddings for the chunks and prepare them for storage.  \n",
    "- **Step 4: Vector DB Storage** : Store the embedded chunks in the database.  \n",
    "\n",
    "### 2. RAG Execution (RunTime) - Steps 5 to 8\n",
    "![rag-2.png](./assets/12-rag-rag-basic-pdf-rag-process-02.png)\n",
    "\n",
    "- **Step 5: Retriever** : Define a retriever to fetch results from the database based on the input query. Retrievers use search algorithms and are categorized as **dense** or **sparse**:\n",
    "  - **Dense** : Similarity-based search.\n",
    "  - **Sparse** : Keyword-based search.\n",
    "\n",
    "- **Step 6: Prompt** : Create a prompt for executing RAG. The ```context``` in the prompt includes content retrieved from the document. Through prompt engineering, you can specify the format of the answer.  \n",
    "\n",
    "- **Step 7: LLM** : Define the language model (e.g., GPT-3.5, GPT-4, Claude).  \n",
    "\n",
    "- **Step 8: Chain** : Create a chain that connects the prompt, LLM, and output.  \n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environment Setup](#environment-setup)\n",
    "- [RAG Basic Pipeline](#rag-basic-pipeline)\n",
    "- [Complete Code](#complete-code)\n",
    "\n",
    "### References\n",
    "\n",
    "- [LangChain How-to guides : Q&A with RAG](https://python.langchain.com/docs/how_to/#qa-with-rag)\n",
    "------"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cf05522",
   "metadata": {},
   "source": [
    "Document Used for Practice\n",
    "A European Approach to Artificial Intelligence - A Policy Perspective\n",
    "\n",
    "- Author: EIT Digital and 5 EIT KICs (EIT Manufacturing, EIT Urban Mobility, EIT Health, EIT Climate-KIC, EIT Digital)\n",
    "- Link: https://eit.europa.eu/sites/default/files/eit-digital-artificial-intelligence-report.pdf\n",
    "- File Name: A European Approach to Artificial Intelligence - A Policy Perspective.pdf\n",
    "\n",
    " _Please copy the downloaded file to the data folder for practice._ "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ff7f057",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
    "\n",
    " **[Note]** \n",
    "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
    "- You can checkout the [ ```langchain-opentutorial``` ](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b05c4862",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a224fd32",
   "metadata": {},
   "source": [
    "Set the API key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "68341201",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [   \n",
    "        \"langchain_community\",\n",
    "        \"langsmith\"\n",
    "        \"langchain\"\n",
    "        \"langchain_text_splitters\"\n",
    "        \"langchain_core\"\n",
    "        \"langchain_openai\"\n",
    "    ],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2059bd47",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Environment variables have been set successfully.\n"
     ]
    }
   ],
   "source": [
    "# Set environment variables\n",
    "from langchain_opentutorial import set_env\n",
    "\n",
    "set_env(\n",
    "    {   \"OPENAI_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_TRACING_V2\": \"true\",\n",
    "        \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n",
    "        \"LANGCHAIN_PROJECT\": \"RAG-Basic-PDF\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "418ab505",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Configuration file for managing API keys as environment variables\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "# Load API key information\n",
    "load_dotenv(override=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b0d050a",
   "metadata": {},
   "source": [
    "## RAG Basic Pipeline\n",
    "\n",
    "Below is the skeleton code for understanding the basic structure of RAG (Retrieval Augmented Generation).\n",
    "\n",
    "The content of each module can be adjusted to fit specific scenarios, allowing for iterative improvement of the structure to suit the documents.\n",
    "\n",
    "(Different options or new techniques can be applied at each step.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f3d1b0fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "from langchain_community.document_loaders import PyMuPDFLoader\n",
    "from langchain_community.vectorstores import FAISS\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "from langchain_core.prompts import PromptTemplate\n",
    "from langchain_openai import ChatOpenAI, OpenAIEmbeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "377894c9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of pages in the document: 24\n"
     ]
    }
   ],
   "source": [
    "# Step 1: Load Documents\n",
    "loader = PyMuPDFLoader(\"./data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf\")\n",
    "docs = loader.load()\n",
    "print(f\"Number of pages in the document: {len(docs)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b34f4fe",
   "metadata": {},
   "source": [
    "Print the content of the page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ddf0d7c7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE\n",
      "11\n",
      "GENERIC \n",
      "There are five issues that, though from slightly different angles, \n",
      "are considered strategic and a potential source of barriers and \n",
      "bottlenecks: data, organisation, human capital, trust, markets. The \n",
      "availability and quality of data, as well as data governance are of \n",
      "strategic importance. Strictly technical issues (i.e., inter-operabi-\n",
      "lity, standardisation) are mostly being solved, whereas internal and \n",
      "external data governance still restrain the full potential of AI Inno-\n",
      "vation. Organisational resources and, also, cognitive and cultural \n",
      "routines are a challenge to cope with for full deployment. On the \n",
      "one hand, there is the issue of the needed investments when evi-\n",
      "dence on return is not yet consolidated. On the other hand, equally \n",
      "important, are cultural conservatism and misalignment between \n",
      "analytical and business objectives. Skills shortages are a main \n",
      "bottleneck in all the four sectors considered in this report where \n",
      "upskilling, reskilling, and new skills creation are considered crucial. \n",
      "For many organisations data scientists are either too expensive or \n",
      "difficult to recruit and retain. There is still a need to build trust on \n",
      "AI, amongst both the final users (consumers, patients, etc.) and \n",
      "intermediate / professional users (i.e., healthcare professionals). \n",
      "This is a matter of privacy and personal data protection, of building \n",
      "a positive institutional narrative backed by mitigation strategies, \n",
      "and of cumulating evidence showing that benefits outweigh costs \n",
      "and risks. As demand for AI innovations is still limited (in many \n",
      "sectors a ‘wait and see’ approach is prevalent) this does not fa-\n",
      "vour the emergence of a competitive supply side. Few start-ups \n",
      "manage to scale up, and many are subsequently bought by a few \n",
      "large dominant players. As a result of the fact that these issues \n",
      "have not yet been solved on a large scale, using a 5 levels scale \n",
      "GENERIC AND CONTEXT DEPENDING \n",
      "OPPORTUNITIES AND POLICY LEVERS\n",
      "of deployment maturity (1= not started; 2= experimentation; 3= \n",
      "practitioner use; 4= professional use; and 5= AI driven companies), \n",
      "it seems that, in all four vertical domains considered, adoption re-\n",
      "mains at level 2 (experimentation) or 3 (practitioner use), with only \n",
      "few advanced exceptions mostly in Manufacturing and Health-\n",
      "care. In Urban Mobility, as phrased by interviewed experts, only \n",
      "lightweight AI applications are widely adopted, whereas in the Cli-\n",
      "mate domain we are just at the level of early predictive models. \n",
      "Considering the different areas of AI applications, regardless of the \n",
      "domains, the most adopted ones include predictive maintenance, \n",
      "chatbots, voice/text recognition, NPL, imagining, computer vision \n",
      "and predictive analytics.\n",
      "MANUFACTURING \n",
      "The manufacturing sector is one of the leaders in application of \n",
      "AI technologies; from significant cuts in unplanned downtime to \n",
      "better designed products, manufacturers are applying AI-powe-\n",
      "red analytics to data to improve efficiency, product quality and \n",
      "the safety of employees. The key application of AI is certainly in \n",
      "predictive maintenance. Yet, the more radical transformation of \n",
      "manufacturing will occur when manufacturers will move to ‘ser-\n",
      "vice-based’ managing of the full lifecycle from consumers pre-\n",
      "ferences to production and delivery (i.e., the Industry 4.0 vision). \n",
      "Manufacturing companies are investing into this vision and are \n",
      "keen to protect their intellectual property generated from such in-\n",
      "vestments. So, there is a concern that a potential new legislative \n",
      "action by the European Commission, which would follow the prin-\n",
      "ciples of the GDPR and the requirements of the White Paper, may \n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(docs[10].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7e2963b",
   "metadata": {},
   "source": [
    "Check the metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6d6b05fd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'id': None,\n",
       " 'metadata': {'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf',\n",
       "  'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf',\n",
       "  'page': 10,\n",
       "  'total_pages': 24,\n",
       "  'format': 'PDF 1.4',\n",
       "  'title': '',\n",
       "  'author': '',\n",
       "  'subject': '',\n",
       "  'keywords': '',\n",
       "  'creator': 'Adobe InDesign 17.3 (Macintosh)',\n",
       "  'producer': 'Adobe PDF Library 16.0.7',\n",
       "  'creationDate': \"D:20220823105611+02'00'\",\n",
       "  'modDate': \"D:20220823105617+02'00'\",\n",
       "  'trapped': ''},\n",
       " 'page_content': 'A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE\\n11\\nGENERIC \\nThere are five issues that, though from slightly different angles, \\nare considered strategic and a potential source of barriers and \\nbottlenecks: data, organisation, human capital, trust, markets. The \\navailability and quality of data, as well as data governance are of \\nstrategic importance. Strictly technical issues (i.e., inter-operabi-\\nlity, standardisation) are mostly being solved, whereas internal and \\nexternal data governance still restrain the full potential of AI Inno-\\nvation. Organisational resources and, also, cognitive and cultural \\nroutines are a challenge to cope with for full deployment. On the \\none hand, there is the issue of the needed investments when evi-\\ndence on return is not yet consolidated. On the other hand, equally \\nimportant, are cultural conservatism and misalignment between \\nanalytical and business objectives. Skills shortages are a main \\nbottleneck in all the four sectors considered in this report where \\nupskilling, reskilling, and new skills creation are considered crucial. \\nFor many organisations data scientists are either too expensive or \\ndifficult to recruit and retain. There is still a need to build trust on \\nAI, amongst both the final users (consumers, patients, etc.) and \\nintermediate / professional users (i.e., healthcare professionals). \\nThis is a matter of privacy and personal data protection, of building \\na positive institutional narrative backed by mitigation strategies, \\nand of cumulating evidence showing that benefits outweigh costs \\nand risks. As demand for AI innovations is still limited (in many \\nsectors a ‘wait and see’ approach is prevalent) this does not fa-\\nvour the emergence of a competitive supply side. Few start-ups \\nmanage to scale up, and many are subsequently bought by a few \\nlarge dominant players. As a result of the fact that these issues \\nhave not yet been solved on a large scale, using a 5 levels scale \\nGENERIC AND CONTEXT DEPENDING \\nOPPORTUNITIES AND POLICY LEVERS\\nof deployment maturity (1= not started; 2= experimentation; 3= \\npractitioner use; 4= professional use; and 5= AI driven companies), \\nit seems that, in all four vertical domains considered, adoption re-\\nmains at level 2 (experimentation) or 3 (practitioner use), with only \\nfew advanced exceptions mostly in Manufacturing and Health-\\ncare. In Urban Mobility, as phrased by interviewed experts, only \\nlightweight AI applications are widely adopted, whereas in the Cli-\\nmate domain we are just at the level of early predictive models. \\nConsidering the different areas of AI applications, regardless of the \\ndomains, the most adopted ones include predictive maintenance, \\nchatbots, voice/text recognition, NPL, imagining, computer vision \\nand predictive analytics.\\nMANUFACTURING \\nThe manufacturing sector is one of the leaders in application of \\nAI technologies; from significant cuts in unplanned downtime to \\nbetter designed products, manufacturers are applying AI-powe-\\nred analytics to data to improve efficiency, product quality and \\nthe safety of employees. The key application of AI is certainly in \\npredictive maintenance. Yet, the more radical transformation of \\nmanufacturing will occur when manufacturers will move to ‘ser-\\nvice-based’ managing of the full lifecycle from consumers pre-\\nferences to production and delivery (i.e., the Industry 4.0 vision). \\nManufacturing companies are investing into this vision and are \\nkeen to protect their intellectual property generated from such in-\\nvestments. So, there is a concern that a potential new legislative \\naction by the European Commission, which would follow the prin-\\nciples of the GDPR and the requirements of the White Paper, may \\n',\n",
       " 'type': 'Document'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[10].__dict__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "1b52f26a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of split chunks: 163\n"
     ]
    }
   ],
   "source": [
    "# Step 2: Split Documents\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)\n",
    "split_documents = text_splitter.split_documents(docs)\n",
    "print(f\"Number of split chunks: {len(split_documents)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "795cfec7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 3: Generate Embeddings\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "82f47754",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 4: Create and Save the Database\n",
    "# Create a vector store.\n",
    "vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "34dd3019",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE\n",
      "14\n",
      "Table 3: Urban Mobility: concerns, opportunities and policy levers.\n",
      "URBAN MOBILITY\n",
      "The adoption of AI in the management of urban mobility systems \n",
      "brings different sets of benefits for private stakeholders (citizens, \n",
      "private companies) and public stakeholders (municipalities, trans-\n",
      "portation service providers). So far only light-weight task specific \n",
      "AI applications have been deployed (i.e., intelligent routing, sharing\n",
      "A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE\n",
      "15\n",
      "One of the most interesting development close to scale up is the \n",
      "creation of platforms, which are fed by all different data sources \n",
      "of transport services (both private and public) and provide the ci-\n",
      "tizens a targeted recommendation on the best way to travel, also \n",
      "based on personal preferences and characteristics. \n",
      "Urban Mobility should focus on what is already potentially avai-\n",
      "apps, predictive models based on citizens’ location and personal \n",
      "data). On the other hand, the most advanced and transformative \n",
      "AI applications, such as autonomous vehicles are lagging behind, \n",
      "especially if compared to the US or China. The key challenge for AI \n",
      "deployment in Urban Mobility sector is the need to find a common \n",
      "win-win business model across a diversity of public and private \n",
      "sector players with different organisational objectives, cultures,\n",
      "care. In Urban Mobility, as phrased by interviewed experts, only \n",
      "lightweight AI applications are widely adopted, whereas in the Cli-\n",
      "mate domain we are just at the level of early predictive models. \n",
      "Considering the different areas of AI applications, regardless of the \n",
      "domains, the most adopted ones include predictive maintenance, \n",
      "chatbots, voice/text recognition, NPL, imagining, computer vision \n",
      "and predictive analytics.\n",
      "MANUFACTURING\n"
     ]
    }
   ],
   "source": [
    "for doc in vectorstore.similarity_search(\"URBAN MOBILITY\"):\n",
    "    print(doc.page_content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "838f7729",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 5: Create Retriever\n",
    "# Search and retrieve information contained in the documents.\n",
    "retriever = vectorstore.as_retriever()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f29da7b4",
   "metadata": {},
   "source": [
    "Send a query to the retriever and check the resulting chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "16c0ad82",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(id='fdfb5187-141a-4693-b5d0-e1066b0ef27f', metadata={'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'page': 9, 'total_pages': 24, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign 17.3 (Macintosh)', 'producer': 'Adobe PDF Library 16.0.7', 'creationDate': \"D:20220823105611+02'00'\", 'modDate': \"D:20220823105617+02'00'\", 'trapped': ''}, page_content='A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE\\n10\\nrequirements becomes mandatory in all sectors and create bar-\\nriers especially for innovators and SMEs. Public procurement ‘data \\nsovereignty clauses’ induce large players to withdraw from AI for \\nurban ecosystems. Strict liability sanctions block AI in healthcare, \\nwhile limiting space of self-driving experimentation. The support \\nmeasures to boost European AI are not sufficient to offset the'),\n",
       " Document(id='5aada0ed-9a07-4c9b-a290-d24856d64494', metadata={'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'page': 22, 'total_pages': 24, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign 17.3 (Macintosh)', 'producer': 'Adobe PDF Library 16.0.7', 'creationDate': \"D:20220823105611+02'00'\", 'modDate': \"D:20220823105617+02'00'\", 'trapped': ''}, page_content='A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE\\n23\\nACKNOWLEDGEMENTS\\nIn the context of their strategic innovation activities for Europe, five EIT Knowledge and Innovation Communities (EIT Manufacturing, EIT Ur-\\nban Mobility, EIT Health, EIT Climate-KIC, and EIT Digital as coordinator) decided to launch a study that identifies general and sector specific \\nconcerns and opportunities for the deployment of AI in Europe.'),\n",
       " Document(id='37657411-894d-4e9c-975b-d1a99ef0e20a', metadata={'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'page': 21, 'total_pages': 24, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign 17.3 (Macintosh)', 'producer': 'Adobe PDF Library 16.0.7', 'creationDate': \"D:20220823105611+02'00'\", 'modDate': \"D:20220823105617+02'00'\", 'trapped': ''}, page_content='sion/presscorner/detail/en/IP_18_6689.\\nEuropean Commission. (2020a). White Paper on Artificial Intelligence. A European Ap-\\nproach to Excellence and Trust. COM(2020) 65 final, Brussels: European Commission. \\nEuropean Commission. (2020b). A European Strategy to Data. COM(2020) 66 final, Brus-\\nsels: European Commission. \\nEuropean Parliament. (2020). Digital sovereignty for Europe. Brussels: European Parliament \\n(retrieved from: https://www.europarl.europa.eu/RegData/etudes/BRIE/2020/651992/'),\n",
       " Document(id='1aa90862-fe35-4797-ad6a-225f9da47824', metadata={'source': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'file_path': './data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf', 'page': 5, 'total_pages': 24, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign 17.3 (Macintosh)', 'producer': 'Adobe PDF Library 16.0.7', 'creationDate': \"D:20220823105611+02'00'\", 'modDate': \"D:20220823105617+02'00'\", 'trapped': ''}, page_content='ries and is the result of a combined effort from five EIT KICs (EIT \\nManufacturing, EIT Urban Mobility, EIT Health, EIT Climate-KIC, \\nand EIT Digital as coordinator). It identifies both general and sec-\\ntor specific concerns and opportunities for the further deployment \\nof AI in Europe. Starting from the background and policy context \\noutlined in this introduction, some critical aspects of AI are fur-\\nther discussed in Section 2. Next, in Section 3 four scenarios')]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever.invoke(\"What is the phased implementation timeline for the EU AI Act?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "3bb3e26f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 6: Create Prompt\n",
    "prompt = PromptTemplate.from_template(\n",
    "    \"\"\"You are an assistant for question-answering tasks. \n",
    "Use the following pieces of retrieved context to answer the question. \n",
    "If you don't know the answer, just say that you don't know. \n",
    "\n",
    "#Context: \n",
    "{context}\n",
    "\n",
    "#Question:\n",
    "{question}\n",
    "\n",
    "#Answer:\"\"\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "669ed5b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 7: Setup LLM\n",
    "llm = ChatOpenAI(model_name=\"gpt-4o\", temperature=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "3113bc05",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 8: Create Chain\n",
    "chain = (\n",
    "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
    "    | prompt\n",
    "    | llm\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e79f4aeb",
   "metadata": {},
   "source": [
    "Input a query (question) into the created chain and execute it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "50d6b7f0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The application of AI in healthcare has so far been confined to administrative tasks, such as Natural Language Processing to extract information from clinical notes or predictive scheduling of visits, and diagnostic tasks, including machine and deep learning applied to imaging in radiology, pathology, and dermatology.\n"
     ]
    }
   ],
   "source": [
    "# Run Chain\n",
    "# Input a query about the document and print the response.\n",
    "question = \"Where has the application of AI in healthcare been confined to so far?\"\n",
    "response = chain.invoke(question)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8444e43",
   "metadata": {},
   "source": [
    "## Complete Code\n",
    "This is a combined code that integrates steps 1 through 8."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "adc45dbf",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "from langchain_community.document_loaders import PyMuPDFLoader\n",
    "from langchain_community.vectorstores import FAISS\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "from langchain_core.prompts import PromptTemplate\n",
    "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
    "\n",
    "# Step 1: Load Documents\n",
    "loader = PyMuPDFLoader(\"./data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf\")\n",
    "docs = loader.load()\n",
    "\n",
    "# Step 2: Split Documents\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)\n",
    "split_documents = text_splitter.split_documents(docs)\n",
    "\n",
    "# Step 3: Generate Embeddings\n",
    "embeddings = OpenAIEmbeddings()\n",
    "\n",
    "# Step 4: Create and Save the Database\n",
    "# Create a vector store.\n",
    "vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)\n",
    "\n",
    "# Step 5: Create Retriever\n",
    "# Search and retrieve information contained in the documents.\n",
    "retriever = vectorstore.as_retriever()\n",
    "\n",
    "# Step 6: Create Prompt\n",
    "prompt = PromptTemplate.from_template(\n",
    "    \"\"\"You are an assistant for question-answering tasks. \n",
    "Use the following pieces of retrieved context to answer the question. \n",
    "If you don't know the answer, just say that you don't know. \n",
    "\n",
    "#Context: \n",
    "{context}\n",
    "\n",
    "#Question:\n",
    "{question}\n",
    "\n",
    "#Answer:\"\"\"\n",
    ")\n",
    "\n",
    "# Step 7: Load LLM\n",
    "llm = ChatOpenAI(model_name=\"gpt-4o\", temperature=0)\n",
    "\n",
    "# Step 8: Create Chain\n",
    "chain = (\n",
    "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
    "    | prompt\n",
    "    | llm\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "c5986cab",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The application of AI in healthcare has so far been confined to administrative tasks, such as Natural Language Processing to extract information from clinical notes or predictive scheduling of visits, and diagnostic tasks, including machine and deep learning applied to imaging in radiology, pathology, and dermatology.\n"
     ]
    }
   ],
   "source": [
    "# Run Chain\n",
    "# Input a query about the document and print the response.\n",
    "question = \"Where has the application of AI in healthcare been confined to so far?\"\n",
    "response = chain.invoke(question)\n",
    "print(response)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain-kr-bpXWMSjn-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
